AITopics | tight regret bound

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Neural Information Processing SystemsDec-25-2025, 03:51:07 GMT

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

model-based reinforcement learning, name change, tight regret bound, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Add feedback

Reviews: Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Neural Information Processing SystemsJan-22-2025, 09:44:05 GMT

As such, it opens up potential new research approaches along with providing an improvement on the SOTA. Quality: The argument is well-developed, and extensive proofs are provided in the supplementary materials or referenced in existing literature. The greedy approach is directly applied to two existing SOTA full-planning-based algorithms, suggesting it is a generalizable alternative. Clarity: The paper is generally well-organized and clear; the paper gives an intuitive sense of the results, although the bulk of the proofs are confined to the supplementary material. Several scattered clarity issues are described in the detailed comments below.

greedy policy, model-based reinforcement learning, tight regret bound, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Reviews: Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Neural Information Processing SystemsJan-22-2025, 09:43:55 GMT

All reviews agree that the contribution is novel and strong. The rebuttal gave important answers and we all strongly defend acceptance.

greedy policy, model-based reinforcement learning, tight regret bound

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Neural Information Processing SystemsOct-9-2024, 17:20:17 GMT

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

model-based reinforcement learning, reinforcement learning, tight regret bound, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.65)

Add feedback

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Efroni, Yonathan, Merlis, Nadav, Ghavamzadeh, Mohammad, Mannor, Shie

Neural Information Processing SystemsMar-19-2020, 01:33:09 GMT

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

model-based reinforcement learning, reinforcement learning, tight regret bound, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.65)

Add feedback

Tight Regret Bounds for Noisy Optimization of a Brownian Motion

Wang, Zexin, Tan, Vincent Y. F., Scarlett, Jonathan

arXiv.org Machine LearningJan-25-2020

We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected simple regret and the smallest possible expected cumulative regret scale as $\Omega(1 / \sqrt{T \log (T)}) \cap \mathcal{O}(\log T / \sqrt{T})$ and $\Omega(\sqrt{T / \log (T)}) \cap \mathcal{O}(\sqrt{T} \cdot \log T)$ respectively. Thus, our upper and lower bounds are tight up to a factor of $\mathcal{O}( (\log T)^{1.5} )$. The upper bound uses an algorithm based on confidence bounds and the Markov property of Brownian motion, and the lower bound is based on a reduction to binary hypothesis testing.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Machine Learning

2001.09327

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

Li, Yingkai, Wang, Yining, Zhou, Yuan

arXiv.org Machine LearningMay-4-2019

Linear contextual bandit is a class of sequential decision making problems with important applications in recommendation systems, online advertising, healthcare, and other machine learning related tasks. While there is much prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we prove regret upper bound of $O(\sqrt{d^2T\log T})\times \mathrm{poly}(\log\log T)$ where $d$ is the domain dimension and $T$ is the time horizon. Our upper bound matches the previous lower bound of $\Omega(\sqrt{d^2 T\log T})$ up to iterated logarithmic terms.

artificial intelligence, bandit, machine learning, (17 more...)

arXiv.org Machine Learning

1905.01435

Genre: Research Report (0.50)

Industry: Health & Medicine (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Tight Regret Bounds for Bayesian Optimization in One Dimension

Scarlett, Jonathan

arXiv.org Machine LearningMay-29-2018

We consider the problem of Bayesian optimization (BO) in one dimension, under a Gaussian process prior and Gaussian sampling noise. We provide a theoretical analysis showing that, under fairly mild technical assumptions on the kernel, the best possible cumulative regret up to time $T$ behaves as $\Omega(\sqrt{T})$ and $O(\sqrt{T\log T})$. This gives a tight characterization up to a $\sqrt{\log T}$ factor, and includes the first non-trivial lower bound for noisy BO. Our assumptions are satisfied, for example, by the squared exponential and Mat\'ern-$\nu$ kernels, with the latter requiring $\nu > 2$. Our results certify the near-optimality of existing bounds (Srinivas {\em et al.}, 2009) for the SE kernel, while proving them to be strictly suboptimal for the Mat\'ern kernel with $\nu > 2$.

artificial intelligence, bayesian optimization, machine learning, (15 more...)

arXiv.org Machine Learning

1805.11792

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Szepesvari, Csaba

arXiv.org Artificial IntelligenceJan-27-2015

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we analyze a UCB-like algorithm for solving the problem, which is known to be computationally efficient; and prove $O(K L (1 / \Delta) \log n)$ and $O(\sqrt{K L n \log n})$ upper bounds on its $n$-step regret, where $L$ is the number of ground items, $K$ is the maximum number of chosen items, and $\Delta$ is the gap between the expected returns of the optimal and best suboptimal solutions. The gap-dependent bound is tight up to a constant factor and the gap-free bound is tight up to a polylogarithmic factor.

artificial intelligence, combucb1, machine learning, (12 more...)

arXiv.org Artificial Intelligence

1410.0949

Country: North America > United States > California > Santa Clara County (0.28)

Genre: Research Report (0.50)

Industry: Education (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback